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Abstract — Data Mining is a process, which involves automatic a^temi^utomatic analysis of large 
quantities of data to extract previously unknown interesting pattairil^j^ gathers knowledge which was 
hidden earlier. It involves various processes of which classificatkjnT^^ociation rule mining and clustering 
gain major attention. One of the emerging application areas of BxcliMining is Social Networks. The focus of 
the research is towards framing classification rules to predic^thejratterns in installation/ usage of Facebook 
applications towards the top most popularly installed/us£afel^plication. The Dataset used in this research is 
Facebook Application installation / usage Dataset wJinw^bntains details of installation of nearly 16,800 
applications among 3 lakhs users. The work begins^n^J^Bata Preprocessing where installation/usage of top 
10 applications (selected based on the count of ifc^Ulations made by users) were used for Process. Various 
Data Mining Classification Algorithms such asN^pTTree, ID3, C-RT, CS-CRT, C4.5 and CS-MC4, Decision 
List, Naives Bayes are applied to preprocesgPl^ata individually and analyzed and the Classification rules for 
predicting the installation / usage of ps^^^lar application are identified. The training Phase is processed 
with Training data and the testing phffffcJStested with test data. 

Keywords - Data Mining; AlgJfcin^ns; Applications; Social Network; Prediction; Facebook; error rates; 
Classification Rules. > 

I. Introduction 

Data Mining makesj^^lf data analysis tools to identify patterns and relationships in voluminous datasets. 
Data Mining Apoirations use classification, clustering, prediction, Association rule mining, pattern 
Recognition incN'Wftern Analysis. Data Mining has found its application in a variety of areas where Social 
Networksj^j^iffiiaj or role. Social network has become omnipresent in today's world. It paves way to share 
inforn^a^on^Tnong any number of people all over the world. Many Online Social networks exist, some of 
Ikide Orkut, Face book, Frienster, Myspace etc., Face book gains prominence over these by 
a record of maximum usage among users with almost 845 million active users as of February 2012. 
ith more than 845 million active users around the world, Face book is today's most prominent social 
lty to connect with diverse audiences, including friends, family, co-workers, constituents, and 
consumers. These connections occur not just through Face book features but through applications ("apps") 
developed by third parties over Face book Platform. 




A. Background of Facebook Applications 

Facebook alone has over 81,000 third-party applications [5]. The Face book users install many applications 
through developer platforms. The Face book Developer Platform was launched in May 2007 [14] with little 
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elaboration and only about eight applications in schedule. As months passed the Platform showed rapid 
growth with more than 35,000 applications by July 2008 [14]. The first step to create a Facebook application 
requires the developer to register the application with Facebook. Each application is assigned an 
application-id and a private application key. All communication between the application and Facebook's 
servers has to be signed with this key. A user can install an application by visiting the application's landing 
page, and accepting the dialog specifying the access rights of the application. However, the user can only 
accept or cancel the dialog. It is not possible to selectively grant or deny access to individual profil 
information. 

/•V 

This paper presents our research work on analysis of installation / usage of various Face book applications 
by the active users and reveals the best classification algorithm in classifying the usage^»f to^p ten 
applications. The classification rules obtained can be used in predicting the usage/iriNkli^Jbn of a 
particular application in future. This paper uses the Face book Application Dataset pra^ft^jDy Minas 
Gjoka and his team. 

The original Dataset consists of two subdivisions. The first subdivision includes a^Sfci set that consists of 
data obtained from Adonomics [4], a service based on statistics reported by FB a period of 6 months 

^a\pncatii 



from Sept. 2007 until Feb. 2008. It gives a detailed description of nearly j4,8#) applications, the number of 
installations of each application and the number of users who use the a^fjrcaroirat least once during a day, 
called Daily Active Users (DAU). The second data set gives details omIi^^icJ book user profile with the 
various applications used by each user. Our work focuses on the sefiowdjataset for the purpose of finding 
classification towards in the installation/ usage of top ten applicatic*^^ 

B. Organization of the Paper 

The rest of the paper is organized as follows. Section 2«Kj\e\ts the related work in this area. Section 3 
describes the data mining framework and the details of^^Xataset used in this research. It also briefs about 
the various classification algorithms that are applied^^his dataset. Experimental results are discussed in 
Section 4 while Section 5 concludes the paper ^j^* 

n. F*lated Work 

The work carried out so far by other re^Srcrrers that are related to Facebook data is concisely presented 
here. However, we wish to state that^n^jrevious research has targeted the Facebook application dataset 
that we have used in our researc 



rc^O 

/vererlev 



Three Facebook applicatio ri\w ere\leveloped and launched which have achieved a combined subscription 
base of over 8 million usefc. "^Exploration of existence of 'communities', with high degree of interaction 
within a community^ a^^Jrrffted interaction outside the community within the context of Face book 
applications [12], [13]. 




Wei Panv and proposed computational model to predict mobile application (known as "apps") 

installation i^in\ social networks and explained the challenges involved in their work. They show the 
importanc^^Swnsidering many factors in predicting app installations, and observed the surprising result 
that aD*Cns^Ilation was indeed predictable [18] . 



Context allows studying social influence processes by tracking the popularity of a complete set of 
tcations installed by the user population of a social networking site. This captures the behavior of all 
iduals who can influence each other in this context. By extending standard fluctuation scaling 
^methods, the collective behavior induced were analyzed by 100 million application installations, and have 
revealed that two distinct regimes of behavior emerge in the system [16]. 

A Proxy on the Client Side system that provides a Facebook user with fine-grained access control 
capabilities over which parts of his / her private profile information can be accessed by third-party 
applications. [17]. 
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D. E. Brown, V. Corruble, and C. L. Pittard [6] compared decision tree classifiers with back propagation 
neural networks for multimodal classification problems. J. Catlett [7] has explained how knowledge 
patterns can be generated from large databases. M. James [8] in his book describes the various classification 
algorithms. T. Cover and P. Hart [9] performed classification using K-NN and proved its accuracy. 



© 



III. Data Mining Framework 

This section gives a brief description of the overall system design and Dataset used in this research. Th 
overall design of the proposed system is given in Figure 1 and each of the components is addressei 
further sections briefly. The design framework for the classification of Facebook ApplMal^ 
usage/installation comprises of the training phase which incorporates the process of training data seise tjbn, 
data pre-processing and generation of classification rules through classification algorithms. Thia^fojlowed 
by an Evaluation phase wherein the classifiers are evaluated based on their error rates, ^^^git phase 
verifies the chosen classifier's accuracy on classifying an unseen Application data 




Figure 1. Overall System Design 



A. Datas 



The DatjfSl^lfcilized for this research is Face book Application dataset. The original Dataset consists of two 



iret BeSfcopti 
DatrfSl^tilized fo 

liv^stais. The first data set consists of detailed description of nearly 16,800 applications, the number of 
n^l^ons/usage of each application and the number of users who use the application at least once 
'a day, called Daily Active Users (DAU). The second data set gives details of the Face book user 
file with the various applications installed/used by each user. This work uses only the second dataset 
iTiich contains a list of installed/used applications for 297K Face book users. UserlDs are anonymised. The 
Dataset is of the form 
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Table I Dataset Under Study 



Data 
Source 


Period 


Data Element 


FB User 
Profile 


20/02/08 
to 

27/02/08 


Users list, 
application installed 



A sample dataset is shown in the Table 2. 

Table II 



A Sample Dataset 



User id 


Appi 


App2 




APP773 


1 


2339854854 


22800106120 




5954997258 


2 


5902932866 


8123226859 




3361908998^ 


3 


6280837251 


5737540558 






4 


2363570816 


2424357634 






5 


17501549056 


5902932866 







B. Data Preprocessing 

The original Dataset shows the installation/usage of vari 
applications installed/used by a user ranges from 3 to j t 
ten applications (based on the number of installati, 
exploiting information about the classification 
applications 




s applications by the users. The number of 
data are preprocessed by identifying the top 
ong the users. This research work focus on 
n predicting the usage of top ten Facebook 



C. Classification Algorithms 



The goal of Classification is to buil<fS 
objects [3]. Classification Algorfdwis^ 
Naives Bayes were applied. The rcH^* f i 



\e» of n 



models that can correctly foresee the class of the different 
RndTree, ID3, C-RT, CS-CRT, C4.5 and CS-MC4, Decision List, 
ing are brief outline of some Classification Algorithms. 



Rnd Tree Algorithm 

The classification wq 
with every tree 
code of the 




^Tro! 



=j5"rollows[2]: the Random Trees classifier takes the input feature vector, classifies it 
rest, and outputs the class label that received the majority of "votes". The pseudo 
algorithm for this domain is given in Figure 2. 



collection of all predictor features -forest} 
t data - feature vector} 

pare the Attribute Values (av) of IP with FT. 
f (IP.av == FT.av) then take the positive branch 
Else take the negative branch } 
for all IP until leaf node is reached. 
End 
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ID3 (Iterative Dichotomiser) Algorithm 



It is an Algorithm used to generate a decision tree invented by Ross Quinlan. ID3 is precursor to the C4.5 
Algorithm. The work flow of the Algorithm is shown in Figure 3. 



C4.5 Algorithm 
It is also called as statiscal 
Check for base case: 
_best be the attribv 
_best. Recurse 



ID3 (Examples, Target_Attribute, Attributes) 
Create a root node for the tree 
If all examples are positive, Return the 
single-node tree Root, with label = +. 
If all examples are negative, Return the 

single-node tree Root, with label = -. 
If number of predicting attributes is empty, 
then Return the single node tree Root, 
with label = most common value of 
the target attribute in the examples. 
Otherwise Begin 
A = The Attribute that best 
Classifies examples. 
Decision Tree attribute for Root 
possible value, ^S, of A, 
Add a new tree branch below Root, 
Corresponding to the test A =Mi. 
Let Examples(Ms) be the subset of e)j 

that have the value Mifor A Jf 
empty 

Then below this new brand 
leaf node with label 
the examples 

Else below this new^jSJtifch add the 

sub tree ID3^Exairrples(^i), Target Attribute, 
Attributes 






, E^jpiples(^) is 

mmon target value in 



End Retu 



Figure 3. ID3 Algorithm 

ier [2]. The pseudo code of the general Algorithm is as follows: 
attribute a, Find the normalized information gain from splitting on a. Let 
the highest normalized information gain .Create a decision node that splits on 
lists obtained by splitting on a t>est, and add those nodes as children of node. 



C-RT & C 
The CA 
decisi 
gini 



d [2] under Tanagra is a very popular Classification tree learning algorithm. CART builds a 
y splitting the records at each node; according to the function of a single attribute it uses the 
for determining the best split. The CS-CRT is similar to CART but with cost sensitive 



cation. 
-MC4A 

"Cost sensitive decision tree Algorithm [2] . This version uses m-estimate smoothed probability estimation (a 
generalization of Laplace estimate). It minimizes the expected loss using misclassification cost matrix for 
the detection of the best prediction within leaves. The precondition required for this Algorithm is that at 
least one discrete attribute (target) and one or more discrete / continuous attribute (input) must be 
available 
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IV. Experimental Results 



This section shows the analysis and results after executing various Classification Algorithms and explores 
the results of the same. The whole experiment is carried out with the Data Mining tool TANAGRA. The 
Applications are ranked based on the count of installations and top ten are ranked based on the count of 
installations and top ten applications are identified. Classification Algorithms like C4.5, C-RT, CS-RT, CS 
MC4, Decision List, ID3, Naive Bayes and RndTree were applied to the pre-processed Data. Th 
Performance of these Algorithms is evaluated based on the error rates. The installation / usage of Top 
Applications considered for the work is shown in Table 3. The error rates of various Classifaia\ 
Algorithms are shown in Table 4. 

Table III List of Top Ten Applications Installed 



The performances 
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lassification Algorithms were evaluated based on the error rates obtained. 
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Table IV Error Rates of Classification Algorithm for 10 Subsets 



Top 10 
Applicatio 
ns 



Error rates of Various Classification Algorithm 




The error rates of various classification ^rgWimms are found using a confusion matrix. Each column of the 
matrix represents the instances in a^J^Ajted class, while each row represents the instances in an actual 
class. One benefit of a confusian ^il|rjjf is that it is easy to see if the System is confusing two classes (i.e. 
commonly mislabeling one as *^H|er). A sample Confusion Matrix for RndTree Algorithm is shown in 
Figure 4. Of all the Algorithms. RnaTree Algorithm gave less error rates. 



ssifier performances 




Error rate 
Values prediction 
Value Recall 1-Precisii 

n 0.999 5 □. 2421 

y 0.0017 0.4357 



CI. 2423 
Confusion matrix 



22 5261 
71967 
297228 



119 

126 
24 5 



Sum 

22 53SO 
72093 
297473 



Figure 4. Confusion Matrix of RndTree Algorithm for Posted items Application 



The rule generated by RndTree towards classification of installing/using application that is ranked 9 and 
ranked 1 is shown in Figure 5 and Figure 6. 



journals@asdf.res. in 



www.asdfjournals.com 



Page 36 of 89 



Vol. 1; Iss 1; Year 2013 



Intl. Jrnl. on Human Machine Interaction 



Decision tree 




t appfi in [n] 




• app3 in [y] 




* appl I] in [n] 




• ipp2in|n] 




tappfin[n] 




»app5in[y] 




o appl in |n] 




o app! in [n] then appl = n (70,76 % of 277 examples) 


iapp9in| 


then appS = n [53 , 13 S of 32 examples) 


» appl in [/] 




»app4in [r 


then app3=n(68,3« of 897 examples) 


o app^ tn 


then app8 =n [61,^15 %of 3^15 examples) 


o app5 in [n] 




»app4in [n] 
» appl in [r 


then appS = n [67.^3 %of 1E81 example;) 


o appl in [i 




»ap 


9in[n] then app8=n(SS,01!iof 4287 examples) 


»ap 


■? in M then appS =n (59,38 Kof MO examples) 


o appl in [/] then 


pp8 = n [58,70 %of 2051 example;) a 


tapp7in M 
» appl in [n] 




«app5in[y] 




oapplin [r 
»ap 


1 in [n] then mLaMn^ examples) 


»ap 


9 in [/] thenpyfrnflof 1 1 examples) 



The generated rules were also used to predict 
with test data and found to be correct. 



kision trj 



J? 



t tha^rapfe 



Figure 5. A Snapshot of rule Generated by RndTr^^^orithm For Posted items Application 

lation/usage of intended applications and tested 



tannin] 
iipln[i 




• app) in [n] 
iapp5in[y] 

' appl in [y] then appl = jp (8=1.27 % of 5633 examples) 
• app3 in [n] 

o appl in [n] then app1=y(82,3S!( of 2)771 examples) 
ispp(ii| 
oapp8in[n] 
□ app10 in [n] 

oapp2i»[n] then app1=y|MSM! examples) 
tapp2 in Pen app1=y(53,Mof 1041 examples] 
oapp10in[j] 

o app2 in [n] then appl =y (7?,03!M74 eiamples) 
« app2 in [y] then appl =y (77,Mof HI eX3|Ses) 

oapp8infy] 
oapp2in[n] 

» app 111 in [n] then app1=y(74!2M2«l examples) 
o appIO in M then appl =y (80,iXof3i examples) 
oapp2 in M then appl = y (85,80 %of 352 examples) 

• app5in[n) 
<app3»|y] 

«ipp!0in|n| then appl=y(7!,8« of 17594 examples) 

• ajpUtlM 
oapp8in[n] 



Figure 6. A Snapshot of rule Generated by RndTree Algorithm For Group Application 
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V. Conclusion 



Social network analysis [SNA] is the mapping and measuring of relationships and flows between people, 
groups, organizations, computers, URLs, and other connected information/knowledge entities. Social 
Network Data is vast and used in many researches. One such data "Facebook Application Dataset" is used 
in this research. There have been a large number of data mining Algorithms rooted in these fields to 
perform different data analysis tasks. In this paper, the comparisons on the performance of various Dat 
Mining Classification Algorithms in effective prediction towards installation of top ten Faceb^O^ 
Applications were analysed. The classification rules produced by various Data Mining Classifawliyjir 
Algorithms are evaluated based on the error rates. From the results it is clear that in all the i^pjen 
applications considered for the research RndTree Algorithm produced less error rates when adTnp^red to 
all other Algorithms and the rules generated by RndTree Algorithm predicted the installS^^nsage of 
intended application among users correctly. The accuracy is tested with a sample test dat^^j^ 
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